Usará un subconjunto del Conjunto LendingClub descargado de Kaggle: https://www.kaggle.com/wordsforthewise/lending-club
LendingClub es una empresa estadounidense de préstamos de tipo peer-to-peer (entre pares), ubicada en San Francisco California.[3] Fue la primera empresa prestamista "peer-to-peer" en registrar su oferta como seguridades ante la Security and Exchange Commission (SEC), y ofrecer intercambio de préstamos en un mercado secundario. LendingClub es la plataforma de préstamos "peer-to-peer" más grande del mundo.
Dados los datos históricos de los préstamos realizados, con información de si el cliente pagó o no, se puede realizar un modelo que pueda predecir si un cliente pagará o no el préstamo recibido. De esta manera, en el futuro cuando llegue un cliente potencial, se puede evaluar si pagará o no el préstamo. Tenga en mente las métricas de clasificación al evaluar el rendimiento de su modelo!
La columna loan_status es la que contiene la étiqueta
Hay muchos conjuntos de datos de LendingClub en Kaggle. Esta información corresponde al conjunto de datos que se le ha provisto:
| LoanStatNew | Description | |
|---|---|---|
| 0 | loan_amnt | The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value. |
| 1 | term | The number of payments on the loan. Values are in months and can be either 36 or 60. |
| 2 | int_rate | Interest Rate on the loan |
| 3 | installment | The monthly payment owed by the borrower if the loan originates. |
| 4 | grade | LC assigned loan grade |
| 5 | sub_grade | LC assigned loan subgrade |
| 6 | emp_title | The job title supplied by the Borrower when applying for the loan.* |
| 7 | emp_length | Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years. |
| 8 | home_ownership | The home ownership status provided by the borrower during registration or obtained from the credit report. Our values are: RENT, OWN, MORTGAGE, OTHER |
| 9 | annual_inc | The self-reported annual income provided by the borrower during registration. |
| 10 | verification_status | Indicates if income was verified by LC, not verified, or if the income source was verified |
| 11 | issue_d | The month which the loan was funded |
| 12 | loan_status | Current status of the loan |
| 13 | purpose | A category provided by the borrower for the loan request. |
| 14 | title | The loan title provided by the borrower |
| 15 | zip_code | The first 3 numbers of the zip code provided by the borrower in the loan application. |
| 16 | addr_state | The state provided by the borrower in the loan application |
| 17 | dti | A ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested LC loan, divided by the borrower’s self-reported monthly income. |
| 18 | earliest_cr_line | The month the borrower's earliest reported credit line was opened |
| 19 | open_acc | The number of open credit lines in the borrower's credit file. |
| 20 | pub_rec | Number of derogatory public records |
| 21 | revol_bal | Total credit revolving balance |
| 22 | revol_util | Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit. |
| 23 | total_acc | The total number of credit lines currently in the borrower's credit file |
| 24 | initial_list_status | The initial listing status of the loan. Possible values are – W, F |
| 25 | application_type | Indicates whether the loan is an individual application or a joint application with two co-borrowers |
| 26 | mort_acc | Number of mortgage accounts. |
| 27 | pub_rec_bankruptcies | Number of public record bankruptcies |
import pandas as pd
data_info = pd.read_csv('lending_club_info.csv',index_col = 'LoanStatNew')
print(data_info.loc['revol_util']['Description'])
Revolving line utilization rate, or the amount of credit the borrower is using relative to all available revolving credit.
def feat_info(col_name):
print(data_info.loc[col_name]['Description'])
feat_info('mort_acc')
Number of mortgage accounts.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
# puede ser necesario o no, dependiendo de su versión de NoteBook
%matplotlib inline
df = pd.read_csv('lending_club_loan_two.csv')
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 396030 entries, 0 to 396029 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loan_amnt 396030 non-null float64 1 term 396030 non-null object 2 int_rate 396030 non-null float64 3 installment 396030 non-null float64 4 grade 396030 non-null object 5 sub_grade 396030 non-null object 6 emp_title 373103 non-null object 7 emp_length 377729 non-null object 8 home_ownership 396030 non-null object 9 annual_inc 396030 non-null float64 10 verification_status 396030 non-null object 11 issue_d 396030 non-null object 12 loan_status 396030 non-null object 13 purpose 396030 non-null object 14 title 394275 non-null object 15 dti 396030 non-null float64 16 earliest_cr_line 396030 non-null object 17 open_acc 396030 non-null float64 18 pub_rec 396030 non-null float64 19 revol_bal 396030 non-null float64 20 revol_util 395754 non-null float64 21 total_acc 396030 non-null float64 22 initial_list_status 396030 non-null object 23 application_type 396030 non-null object 24 mort_acc 358235 non-null float64 25 pub_rec_bankruptcies 395495 non-null float64 26 address 396030 non-null object dtypes: float64(12), object(15) memory usage: 81.6+ MB
TAREA: Ya que estará intentando predecir el estatus de préstamo, crear una gráfica de conteo, como se muestra abajo.
df["loan_status"].value_counts().plot(kind='bar', color=["blue","orange"])
plt.xlabel('loan_status')
plt.ylabel('count')
plt.show()
TAREA: Crear un histograma de la columna loan_amnt.
plt.rcParams["figure.figsize"] = (12,4)
plot = plt.hist(x=df["loan_amnt"], bins=40, alpha=0.4, rwidth=20)
TAREA: Explore la correlación entre las variables contínuas.
correlation = df.corr()
correlation
| loan_amnt | int_rate | installment | annual_inc | dti | open_acc | pub_rec | revol_bal | revol_util | total_acc | mort_acc | pub_rec_bankruptcies | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| loan_amnt | 1.000000 | 0.168921 | 0.953929 | 0.336887 | 0.016636 | 0.198556 | -0.077779 | 0.328320 | 0.099911 | 0.223886 | 0.222315 | -0.106539 |
| int_rate | 0.168921 | 1.000000 | 0.162758 | -0.056771 | 0.079038 | 0.011649 | 0.060986 | -0.011280 | 0.293659 | -0.036404 | -0.082583 | 0.057450 |
| installment | 0.953929 | 0.162758 | 1.000000 | 0.330381 | 0.015786 | 0.188973 | -0.067892 | 0.316455 | 0.123915 | 0.202430 | 0.193694 | -0.098628 |
| annual_inc | 0.336887 | -0.056771 | 0.330381 | 1.000000 | -0.081685 | 0.136150 | -0.013720 | 0.299773 | 0.027871 | 0.193023 | 0.236320 | -0.050162 |
| dti | 0.016636 | 0.079038 | 0.015786 | -0.081685 | 1.000000 | 0.136181 | -0.017639 | 0.063571 | 0.088375 | 0.102128 | -0.025439 | -0.014558 |
| open_acc | 0.198556 | 0.011649 | 0.188973 | 0.136150 | 0.136181 | 1.000000 | -0.018392 | 0.221192 | -0.131420 | 0.680728 | 0.109205 | -0.027732 |
| pub_rec | -0.077779 | 0.060986 | -0.067892 | -0.013720 | -0.017639 | -0.018392 | 1.000000 | -0.101664 | -0.075910 | 0.019723 | 0.011552 | 0.699408 |
| revol_bal | 0.328320 | -0.011280 | 0.316455 | 0.299773 | 0.063571 | 0.221192 | -0.101664 | 1.000000 | 0.226346 | 0.191616 | 0.194925 | -0.124532 |
| revol_util | 0.099911 | 0.293659 | 0.123915 | 0.027871 | 0.088375 | -0.131420 | -0.075910 | 0.226346 | 1.000000 | -0.104273 | 0.007514 | -0.086751 |
| total_acc | 0.223886 | -0.036404 | 0.202430 | 0.193023 | 0.102128 | 0.680728 | 0.019723 | 0.191616 | -0.104273 | 1.000000 | 0.381072 | 0.042035 |
| mort_acc | 0.222315 | -0.082583 | 0.193694 | 0.236320 | -0.025439 | 0.109205 | 0.011552 | 0.194925 | 0.007514 | 0.381072 | 1.000000 | 0.027239 |
| pub_rec_bankruptcies | -0.106539 | 0.057450 | -0.098628 | -0.050162 | -0.014558 | -0.027732 | 0.699408 | -0.124532 | -0.086751 | 0.042035 | 0.027239 | 1.000000 |
TAREA: Visualice la tabla anterios utilizando un mapa de calor. Dependiendo de su version de matplotlib, puede que necesite ajustar manualmente el mapap.
import seaborn as sn
plt.figure(figsize = (11, 9))
sn.heatmap(correlation, annot=True)
plt.show()
TAREA: Debió haber notado la correlación casi perfecta con la columna "installment". Explore esta columna aún más. Despliegue las descripciones y realice una gráfica de dispersión entre ellas. ¿Será que esta relación tiene sentido? ¿Cree que pueda haber información duplicada aquí?
feat_info('installment')
The monthly payment owed by the borrower if the loan originates.
feat_info('loan_amnt')
The listed amount of the loan applied for by the borrower. If at some point in time, the credit department reduces the loan amount, then it will be reflected in this value.
plt.rcParams["figure.figsize"] = (6,4)
plt.scatter(df.installment, df.loan_amnt, marker = "o", edgecolor="white");
Si tiene sentido, ya que una es la cantidad del prestamo dado, y el otro es la cantidad mensual que el prestatario da. Esto tiene sentido ya que el prestamista aboanara una cantidad directamente relacionada a la que se le presto. No consideramos que sea informacion duplicada pero si es informacion que esta directamente relacionada.
TAREA: Crear una gráfica de caja con bigote (boxplot) que muestre la relación entre loan_status y el monto del préstamo.
sns.boxplot(y='loan_amnt', x='loan_status',
data=df)
<AxesSubplot:xlabel='loan_status', ylabel='loan_amnt'>
TAREA: Calcule las estadísticas sumarias para el monto del préstamo, agrupado por loan_status.
df[['loan_amnt','loan_status']].groupby(['loan_status']).describe()
| loan_amnt | ||||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | |
| loan_status | ||||||||
| Charged Off | 77673.0 | 15126.300967 | 8505.090557 | 1000.0 | 8525.0 | 14000.0 | 20000.0 | 40000.0 |
| Fully Paid | 318357.0 | 13866.878771 | 8302.319699 | 500.0 | 7500.0 | 12000.0 | 19225.0 | 40000.0 |
TAREA: Explore las columnas Grade y SubGrade que LendingClub attribuye a los préstamos. ¿Cuáles son los valores posibles de estas?
df.grade.unique()
array(['B', 'A', 'C', 'E', 'D', 'F', 'G'], dtype=object)
df.sub_grade.unique()
array(['B4', 'B5', 'B3', 'A2', 'C5', 'C3', 'A1', 'B2', 'C1', 'A5', 'E4',
'A4', 'A3', 'D1', 'C2', 'B1', 'D3', 'D5', 'D2', 'E1', 'E2', 'E5',
'F4', 'E3', 'D4', 'G1', 'F5', 'G2', 'C4', 'F1', 'F3', 'G5', 'G4',
'F2', 'G3'], dtype=object)
TAREA: Crear una gráfica de conteo por grade. Configure el color (hue) con la etiqueta loan_status.
plt.rcParams["figure.figsize"] = (6,4)
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="grade", data=df, hue = "loan_status")
TAREA: Despliegue una gráfica de conteo por subgrado. Es posible que necesite modificar el tamaño para esta gráfica y reordenar el eje x. Siéntase libre de editar la paleta de color. Explore ambos los préstamos hechos por subgrado, así como ser separado basado en el estatus del préstamo. Luego de crear esta gráfica, realice una gráfica similar pero configure hue = "loan_status"
plt.rcParams["figure.figsize"] = (12,4)
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="sub_grade", data=df,
order = ["A1","A2","A3","A4","A5","B1","B2","B3","B4","B5",
"C1","C2","C3","C4","C5","D1","D2","D3","D4","D5",
"E1","E2","E3","E4","E5","F1","F2","F3","F4","F5",
"G1","G2","G3","G4","G5"])
plt.rcParams["figure.figsize"] = (12,4)
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="sub_grade", data=df, hue = "loan_status",
order = ["A1","A2","A3","A4","A5","B1","B2","B3","B4","B5",
"C1","C2","C3","C4","C5","D1","D2","D3","D4","D5",
"E1","E2","E3","E4","E5","F1","F2","F3","F4","F5",
"G1","G2","G3","G4","G5"])
TAREA: Parece que los subgrados F y G subgrades no son re-pagados tan frecuentemente. Aisle esos y vuelva a hacer la gráfica solo con los dos.
plt.rcParams["figure.figsize"] = (12,4)
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="sub_grade", data=df, hue = "loan_status",
order = ["F1","F2","F3","F4","F5",
"G1","G2","G3","G4","G5"])
TAREA: Crear una nueva columna llamada 'loan_repaid' que contendrá un 1 si el estatus del préstamos fue "Fully Paid" y un 0 si fue "Charged Off".
df["loan_repaid"] = (df['loan_status']=="Fully Paid").astype(int)
df[["loan_repaid", "loan_status"]]
| loan_repaid | loan_status | |
|---|---|---|
| 0 | 1 | Fully Paid |
| 1 | 1 | Fully Paid |
| 2 | 1 | Fully Paid |
| 3 | 1 | Fully Paid |
| 4 | 0 | Charged Off |
| ... | ... | ... |
| 396025 | 1 | Fully Paid |
| 396026 | 1 | Fully Paid |
| 396027 | 1 | Fully Paid |
| 396028 | 1 | Fully Paid |
| 396029 | 1 | Fully Paid |
396030 rows × 2 columns
TAREA RETO: (Esto es difícil, pero se puede realizarse en una línea!) Crear una gráfica de barras mostrando la correlación de las variablres numéricas a la columna nueva loan_repaid. Enlace útil
df.corr()[['loan_repaid']].sort_values('loan_repaid')[:-1].plot.bar()
<AxesSubplot:>
Objetivo de la sección: Eliminar o llenar cualquier dato faltante. Eliminar variables innecesarias o repetitivas. Convertir variables tipo cadena a variables dummy.
df.head()
| loan_amnt | term | int_rate | installment | grade | sub_grade | emp_title | emp_length | home_ownership | annual_inc | ... | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | application_type | mort_acc | pub_rec_bankruptcies | address | loan_repaid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 months | 11.44 | 329.48 | B | B4 | Marketing | 10+ years | RENT | 117000.0 | ... | 0.0 | 36369.0 | 41.8 | 25.0 | w | INDIVIDUAL | 0.0 | 0.0 | 0174 Michelle Gateway\r\nMendozaberg, OK 22690 | 1 |
| 1 | 8000.0 | 36 months | 11.99 | 265.68 | B | B5 | Credit analyst | 4 years | MORTGAGE | 65000.0 | ... | 0.0 | 20131.0 | 53.3 | 27.0 | f | INDIVIDUAL | 3.0 | 0.0 | 1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113 | 1 |
| 2 | 15600.0 | 36 months | 10.49 | 506.97 | B | B3 | Statistician | < 1 year | RENT | 43057.0 | ... | 0.0 | 11987.0 | 92.2 | 26.0 | f | INDIVIDUAL | 0.0 | 0.0 | 87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113 | 1 |
| 3 | 7200.0 | 36 months | 6.49 | 220.65 | A | A2 | Client Advocate | 6 years | RENT | 54000.0 | ... | 0.0 | 5472.0 | 21.5 | 13.0 | f | INDIVIDUAL | 0.0 | 0.0 | 823 Reid Ford\r\nDelacruzside, MA 00813 | 1 |
| 4 | 24375.0 | 60 months | 17.27 | 609.33 | C | C5 | Destiny Management Inc. | 9 years | MORTGAGE | 55000.0 | ... | 0.0 | 24584.0 | 69.8 | 43.0 | f | INDIVIDUAL | 1.0 | 0.0 | 679 Luna Roads\r\nGreggshire, VA 11650 | 0 |
5 rows × 28 columns
Explore las columnas con datos faltantes. Se pueden utilizar una variedad de factores para decidir si una variable podría, o no, ser útil y permitir ver si los datos faltantes se deben mantener, desechar o llenar.
TAREA: ¿Cuál es la longitud del dataframe?
len(df)
396030
TAREA: Crear una Serie que muestre el conteo total de valoes faltantes, por columna.
df.isnull().sum()
loan_amnt 0 term 0 int_rate 0 installment 0 grade 0 sub_grade 0 emp_title 22927 emp_length 18301 home_ownership 0 annual_inc 0 verification_status 0 issue_d 0 loan_status 0 purpose 0 title 1755 dti 0 earliest_cr_line 0 open_acc 0 pub_rec 0 revol_bal 0 revol_util 276 total_acc 0 initial_list_status 0 application_type 0 mort_acc 37795 pub_rec_bankruptcies 535 address 0 loan_repaid 0 dtype: int64
TAREA: Convertir esta Serie para que esté en términos del porcentaje del DataFrame total.
df.isnull().sum()/len(df)*100
loan_amnt 0.000000 term 0.000000 int_rate 0.000000 installment 0.000000 grade 0.000000 sub_grade 0.000000 emp_title 5.789208 emp_length 4.621115 home_ownership 0.000000 annual_inc 0.000000 verification_status 0.000000 issue_d 0.000000 loan_status 0.000000 purpose 0.000000 title 0.443148 dti 0.000000 earliest_cr_line 0.000000 open_acc 0.000000 pub_rec 0.000000 revol_bal 0.000000 revol_util 0.069692 total_acc 0.000000 initial_list_status 0.000000 application_type 0.000000 mort_acc 9.543469 pub_rec_bankruptcies 0.135091 address 0.000000 loan_repaid 0.000000 dtype: float64
TAREA: Examinew emp_title y emp_length para ver si está bien eliminarlas. Imprima la información de esta columnas utilizandola función feat_info() que está al inicio de este notebook.
feat_info("emp_title")
The job title supplied by the Borrower when applying for the loan.*
feat_info("emp_length")
Employment length in years. Possible values are between 0 and 10 where 0 means less than one year and 10 means ten or more years.
TAREA: Cuántos nombres de puestos únicos existen?
len(df.emp_title.unique())
173106
df_freq = df.groupby('emp_title').size().reset_index(name='Count').rename(columns={'emp_title':'Col_value'})
df_freq.sort_values(by=['Count'],ascending=False)
| Col_value | Count | |
|---|---|---|
| 131646 | Teacher | 4389 |
| 79440 | Manager | 4250 |
| 107851 | Registered Nurse | 1856 |
| 105520 | RN | 1846 |
| 128747 | Supervisor | 1830 |
| ... | ... | ... |
| 62994 | Hy Powered | 1 |
| 62995 | Hy Vee | 1 |
| 62996 | Hy-Vee Food Store | 1 |
| 62997 | HyPro, Inc. | 1 |
| 173104 | License Compliance Investigator | 1 |
173105 rows × 2 columns
TAREA: Viéndolo realísticamente, hay demasiados nombres de puestos únicos como para tratar de convertirlos a variables dummy. Elimine esa columna.
from quickda.clean_data import *
df = clean(df, method = 'dropcols', columns = ['emp_title'])
df.head()
| loan_amnt | term | int_rate | installment | grade | sub_grade | emp_length | home_ownership | annual_inc | verification_status | ... | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | application_type | mort_acc | pub_rec_bankruptcies | address | loan_repaid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 months | 11.44 | 329.48 | B | B4 | 10+ years | RENT | 117000.0 | Not Verified | ... | 0.0 | 36369.0 | 41.8 | 25.0 | w | INDIVIDUAL | 0.0 | 0.0 | 0174 Michelle Gateway\r\nMendozaberg, OK 22690 | 1 |
| 1 | 8000.0 | 36 months | 11.99 | 265.68 | B | B5 | 4 years | MORTGAGE | 65000.0 | Not Verified | ... | 0.0 | 20131.0 | 53.3 | 27.0 | f | INDIVIDUAL | 3.0 | 0.0 | 1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113 | 1 |
| 2 | 15600.0 | 36 months | 10.49 | 506.97 | B | B3 | < 1 year | RENT | 43057.0 | Source Verified | ... | 0.0 | 11987.0 | 92.2 | 26.0 | f | INDIVIDUAL | 0.0 | 0.0 | 87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113 | 1 |
| 3 | 7200.0 | 36 months | 6.49 | 220.65 | A | A2 | 6 years | RENT | 54000.0 | Not Verified | ... | 0.0 | 5472.0 | 21.5 | 13.0 | f | INDIVIDUAL | 0.0 | 0.0 | 823 Reid Ford\r\nDelacruzside, MA 00813 | 1 |
| 4 | 24375.0 | 60 months | 17.27 | 609.33 | C | C5 | 9 years | MORTGAGE | 55000.0 | Verified | ... | 0.0 | 24584.0 | 69.8 | 43.0 | f | INDIVIDUAL | 1.0 | 0.0 | 679 Luna Roads\r\nGreggshire, VA 11650 | 0 |
5 rows × 27 columns
TAREA: Crear una gráfica de conteo de la columna emp_length. Reto: Haga un "sort" por el orden de los valores.
df.emp_length.unique()
array(['10+ years', '4 years', '< 1 year', '6 years', '9 years',
'2 years', '3 years', '8 years', '7 years', '5 years', '1 year',
nan], dtype=object)
orderY = ['< 1 year','1 year','2 years','3 years','4 years','5 years','6 years','7 years','8 years','9 years','10+ years']
plt.rcParams["figure.figsize"] = (12,4)
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="emp_length", data=df,
order = orderY)
TAREA: Despliegue la gráfica de conteo con un color (hue) separando Fully Paid vs Charged Off
plt.rcParams["figure.figsize"] = (12,4)
sns.set_theme(style="darkgrid")
ax = sns.countplot(x="emp_length", data=df, hue = "loan_status",
order = orderY)
TAREA RETO: Esta gráfica aún no da información sobre si hay una relacióm fuerte entre el tiempo de empleo y estar "charged off". Lo que se quiere es el porcentaje de "charge offs" por categoría. Esencialmente informando qué porcentaje de personas por categoría de empleo no pagaron su préstamo. Hay muchas formas de crear esta serie. Una vez la haya creado, trate de visualizarla con una gráfica de barras. Esto puede ser algo difícil.
df_freq = df.groupby('emp_length').size().reset_index(name='len').rename(columns={'emp_length':'Col_value'})
df_freq
| Col_value | len | |
|---|---|---|
| 0 | 1 year | 25882 |
| 1 | 10+ years | 126041 |
| 2 | 2 years | 35827 |
| 3 | 3 years | 31665 |
| 4 | 4 years | 23952 |
| 5 | 5 years | 26495 |
| 6 | 6 years | 20841 |
| 7 | 7 years | 20819 |
| 8 | 8 years | 19168 |
| 9 | 9 years | 15314 |
| 10 | < 1 year | 31725 |
df_charged = df[df['loan_status'] == 'Charged Off'].groupby(['emp_length']).size().reset_index(name='len').rename(columns={'emp_length':'Col_value'})
df_charged
| Col_value | len | |
|---|---|---|
| 0 | 1 year | 5154 |
| 1 | 10+ years | 23215 |
| 2 | 2 years | 6924 |
| 3 | 3 years | 6182 |
| 4 | 4 years | 4608 |
| 5 | 5 years | 5092 |
| 6 | 6 years | 3943 |
| 7 | 7 years | 4055 |
| 8 | 8 years | 3829 |
| 9 | 9 years | 3070 |
| 10 | < 1 year | 6563 |
df_freq["len"] = df_charged["len"]/df_freq["len"]
df_freq
| Col_value | len | |
|---|---|---|
| 0 | 1 year | 0.199135 |
| 1 | 10+ years | 0.184186 |
| 2 | 2 years | 0.193262 |
| 3 | 3 years | 0.195231 |
| 4 | 4 years | 0.192385 |
| 5 | 5 years | 0.192187 |
| 6 | 6 years | 0.189194 |
| 7 | 7 years | 0.194774 |
| 8 | 8 years | 0.199760 |
| 9 | 9 years | 0.200470 |
| 10 | < 1 year | 0.206872 |
plt.rcParams["figure.figsize"] = (6,4)
df_freq.plot(kind="bar")
<AxesSubplot:>
TAREA: Las tasas de "Charge off" son extremadamente similares a través de todos los tiempos de empleo. Adelante, elimine la columna emp_length
df = clean(df, method = 'dropcols', columns = ['emp_length'])
df.head()
| loan_amnt | term | int_rate | installment | grade | sub_grade | home_ownership | annual_inc | verification_status | issue_d | ... | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | application_type | mort_acc | pub_rec_bankruptcies | address | loan_repaid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 months | 11.44 | 329.48 | B | B4 | RENT | 117000.0 | Not Verified | Jan-2015 | ... | 0.0 | 36369.0 | 41.8 | 25.0 | w | INDIVIDUAL | 0.0 | 0.0 | 0174 Michelle Gateway\r\nMendozaberg, OK 22690 | 1 |
| 1 | 8000.0 | 36 months | 11.99 | 265.68 | B | B5 | MORTGAGE | 65000.0 | Not Verified | Jan-2015 | ... | 0.0 | 20131.0 | 53.3 | 27.0 | f | INDIVIDUAL | 3.0 | 0.0 | 1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113 | 1 |
| 2 | 15600.0 | 36 months | 10.49 | 506.97 | B | B3 | RENT | 43057.0 | Source Verified | Jan-2015 | ... | 0.0 | 11987.0 | 92.2 | 26.0 | f | INDIVIDUAL | 0.0 | 0.0 | 87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113 | 1 |
| 3 | 7200.0 | 36 months | 6.49 | 220.65 | A | A2 | RENT | 54000.0 | Not Verified | Nov-2014 | ... | 0.0 | 5472.0 | 21.5 | 13.0 | f | INDIVIDUAL | 0.0 | 0.0 | 823 Reid Ford\r\nDelacruzside, MA 00813 | 1 |
| 4 | 24375.0 | 60 months | 17.27 | 609.33 | C | C5 | MORTGAGE | 55000.0 | Verified | Apr-2013 | ... | 0.0 | 24584.0 | 69.8 | 43.0 | f | INDIVIDUAL | 1.0 | 0.0 | 679 Luna Roads\r\nGreggshire, VA 11650 | 0 |
5 rows × 26 columns
TAREA: Vuelva a explorar el DataFrame para ver qué columnas aún tiene datos faltantes.
df.isnull().sum()
loan_amnt 0 term 0 int_rate 0 installment 0 grade 0 sub_grade 0 home_ownership 0 annual_inc 0 verification_status 0 issue_d 0 loan_status 0 purpose 0 title 1755 dti 0 earliest_cr_line 0 open_acc 0 pub_rec 0 revol_bal 0 revol_util 276 total_acc 0 initial_list_status 0 application_type 0 mort_acc 37795 pub_rec_bankruptcies 535 address 0 loan_repaid 0 dtype: int64
TAREA: Revise la columna title vs la columna purpose. ¿Es esta información repetida?
df['title'].head(10)
0 Vacation 1 Debt consolidation 2 Credit card refinancing 3 Credit card refinancing 4 Credit Card Refinance 5 Debt consolidation 6 Home improvement 7 No More Credit Cards 8 Debt consolidation 9 Debt Consolidation Name: title, dtype: object
df['purpose'].head(10)
0 vacation 1 debt_consolidation 2 credit_card 3 credit_card 4 credit_card 5 debt_consolidation 6 home_improvement 7 credit_card 8 debt_consolidation 9 debt_consolidation Name: purpose, dtype: object
Es informacion repetida sin embargo purpose no tiene datos faltantes.
TAREA: La columna title es simplemente una cadena subcategoría/descripción de la columna purpose. Elimine la columna title.
df = clean(df, method = 'dropcols', columns = ['title'])
df.head()
| loan_amnt | term | int_rate | installment | grade | sub_grade | home_ownership | annual_inc | verification_status | issue_d | ... | pub_rec | revol_bal | revol_util | total_acc | initial_list_status | application_type | mort_acc | pub_rec_bankruptcies | address | loan_repaid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 months | 11.44 | 329.48 | B | B4 | RENT | 117000.0 | Not Verified | Jan-2015 | ... | 0.0 | 36369.0 | 41.8 | 25.0 | w | INDIVIDUAL | 0.0 | 0.0 | 0174 Michelle Gateway\r\nMendozaberg, OK 22690 | 1 |
| 1 | 8000.0 | 36 months | 11.99 | 265.68 | B | B5 | MORTGAGE | 65000.0 | Not Verified | Jan-2015 | ... | 0.0 | 20131.0 | 53.3 | 27.0 | f | INDIVIDUAL | 3.0 | 0.0 | 1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113 | 1 |
| 2 | 15600.0 | 36 months | 10.49 | 506.97 | B | B3 | RENT | 43057.0 | Source Verified | Jan-2015 | ... | 0.0 | 11987.0 | 92.2 | 26.0 | f | INDIVIDUAL | 0.0 | 0.0 | 87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113 | 1 |
| 3 | 7200.0 | 36 months | 6.49 | 220.65 | A | A2 | RENT | 54000.0 | Not Verified | Nov-2014 | ... | 0.0 | 5472.0 | 21.5 | 13.0 | f | INDIVIDUAL | 0.0 | 0.0 | 823 Reid Ford\r\nDelacruzside, MA 00813 | 1 |
| 4 | 24375.0 | 60 months | 17.27 | 609.33 | C | C5 | MORTGAGE | 55000.0 | Verified | Apr-2013 | ... | 0.0 | 24584.0 | 69.8 | 43.0 | f | INDIVIDUAL | 1.0 | 0.0 | 679 Luna Roads\r\nGreggshire, VA 11650 | 0 |
5 rows × 25 columns
NOTA: Esta es una de las partes más difíciles del laboratorio! Siéntase en libertad de llenar o eliminar los valores faltantesde la columna mort_acc, como se sienta más cómodo! Acá se está siguiendo una línea muy específica.
TAREA: Averigue qué es lo que representa la columna mort_acc
feat_info('mort_acc')
Number of mortgage accounts.
TAREA: Crear un conteo de valores de la columna mort_acc.
df_freq = df.groupby('mort_acc').size().reset_index(name='Count').rename(columns={'emp_title':'Col_value'})
df_freq.sort_values(by=['Count'],ascending=False)
| mort_acc | Count | |
|---|---|---|
| 0 | 0.0 | 139777 |
| 1 | 1.0 | 60416 |
| 2 | 2.0 | 49948 |
| 3 | 3.0 | 38049 |
| 4 | 4.0 | 27887 |
| 5 | 5.0 | 18194 |
| 6 | 6.0 | 11069 |
| 7 | 7.0 | 6052 |
| 8 | 8.0 | 3121 |
| 9 | 9.0 | 1656 |
| 10 | 10.0 | 865 |
| 11 | 11.0 | 479 |
| 12 | 12.0 | 264 |
| 13 | 13.0 | 146 |
| 14 | 14.0 | 107 |
| 15 | 15.0 | 61 |
| 16 | 16.0 | 37 |
| 17 | 17.0 | 22 |
| 18 | 18.0 | 18 |
| 19 | 19.0 | 15 |
| 20 | 20.0 | 13 |
| 24 | 24.0 | 10 |
| 22 | 22.0 | 7 |
| 21 | 21.0 | 4 |
| 25 | 25.0 | 4 |
| 27 | 27.0 | 3 |
| 23 | 23.0 | 2 |
| 26 | 26.0 | 2 |
| 30 | 31.0 | 2 |
| 31 | 32.0 | 2 |
| 28 | 28.0 | 1 |
| 29 | 30.0 | 1 |
| 32 | 34.0 | 1 |
TAREA: Hay muchas formas de tratar los datos faltantes. Se podría intentar hacer un modelo simple para llenarlos, tal como un modelo lineal, que las llene basado en la media de las otras columnas, o se podrían encasillar las columnas en categorías y fijar el NaN como una de las categorías. No existe un método 100% correcto! Veamos las otras columnas para ver cuál(es) se correlacionan altamente a mort_acc
df.corr()[['mort_acc']].sort_values(by=['mort_acc'])
| mort_acc | |
|---|---|
| int_rate | -0.082583 |
| dti | -0.025439 |
| revol_util | 0.007514 |
| pub_rec | 0.011552 |
| pub_rec_bankruptcies | 0.027239 |
| loan_repaid | 0.073111 |
| open_acc | 0.109205 |
| installment | 0.193694 |
| revol_bal | 0.194925 |
| loan_amnt | 0.222315 |
| annual_inc | 0.236320 |
| total_acc | 0.381072 |
| mort_acc | 1.000000 |
TAREA: Parece que la columna total_acc tiene correlación con la columna mort_acc , esto tiene sentido! Pruebe esta metodología fillna(). Agrupe el dataframe por la columna total_acc y calcule la media para mort_acc por cada instancia de total_acc. Para obtener los resultados de abajo:
df_freq = df.groupby('total_acc')['mort_acc'].mean().reset_index(name='Count').rename(columns={'emp_title':'Col_value'})
df_freq.sort_values(by=['total_acc'])
| total_acc | Count | |
|---|---|---|
| 0 | 2.0 | 0.000000 |
| 1 | 3.0 | 0.052023 |
| 2 | 4.0 | 0.066743 |
| 3 | 5.0 | 0.103289 |
| 4 | 6.0 | 0.151293 |
| ... | ... | ... |
| 113 | 124.0 | 1.000000 |
| 114 | 129.0 | 1.000000 |
| 115 | 135.0 | 3.000000 |
| 116 | 150.0 | 2.000000 |
| 117 | 151.0 | 0.000000 |
118 rows × 2 columns
TAREA RETO: Llene los valores faltantes de la columna mort_acc basados en su valor de total_acc. Si falta el valor de mort_acc, entonces lléne ese valor faltante con la media correspondiente a su valor total_acc de la Series que se creo arriba. Esto requiere usar el método .apply() con dos columnas. Vea el enlace de abajo para tener más información.
df['mort_acc'] = df.groupby('total_acc')['mort_acc'].transform(
lambda grp: grp.fillna(np.mean(grp))
)
df.isnull().sum()
loan_amnt 0 term 0 int_rate 0 installment 0 grade 0 sub_grade 0 home_ownership 0 annual_inc 0 verification_status 0 issue_d 0 loan_status 0 purpose 0 dti 0 earliest_cr_line 0 open_acc 0 pub_rec 0 revol_bal 0 revol_util 276 total_acc 0 initial_list_status 0 application_type 0 mort_acc 0 pub_rec_bankruptcies 535 address 0 loan_repaid 0 dtype: int64
TAREA: revol_util y pub_rec_bankruptcies tienen datos faltantes, pero estos son menos del 0.5% de todos los datos. Elimine las filas que tienen esos datos faltantes con dropna().
df = df.dropna()
df.isnull().sum()
loan_amnt 0 term 0 int_rate 0 installment 0 grade 0 sub_grade 0 home_ownership 0 annual_inc 0 verification_status 0 issue_d 0 loan_status 0 purpose 0 dti 0 earliest_cr_line 0 open_acc 0 pub_rec 0 revol_bal 0 revol_util 0 total_acc 0 initial_list_status 0 application_type 0 mort_acc 0 pub_rec_bankruptcies 0 address 0 loan_repaid 0 dtype: int64
Con lo anterior ya quedó resuelto lo de los datos faltantes! Ahora solo falta ver cómo manejar los valores cadena debido a las columnas categóricas.
TAREA: Liste todas las columnas no numérias. Enlace de ayuda
# SU CODIGO AQUI
df.select_dtypes(include='object')
| term | grade | sub_grade | home_ownership | verification_status | issue_d | loan_status | purpose | earliest_cr_line | initial_list_status | application_type | address | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 36 months | B | B4 | RENT | Not Verified | Jan-2015 | Fully Paid | vacation | Jun-1990 | w | INDIVIDUAL | 0174 Michelle Gateway\r\nMendozaberg, OK 22690 |
| 1 | 36 months | B | B5 | MORTGAGE | Not Verified | Jan-2015 | Fully Paid | debt_consolidation | Jul-2004 | f | INDIVIDUAL | 1076 Carney Fort Apt. 347\r\nLoganmouth, SD 05113 |
| 2 | 36 months | B | B3 | RENT | Source Verified | Jan-2015 | Fully Paid | credit_card | Aug-2007 | f | INDIVIDUAL | 87025 Mark Dale Apt. 269\r\nNew Sabrina, WV 05113 |
| 3 | 36 months | A | A2 | RENT | Not Verified | Nov-2014 | Fully Paid | credit_card | Sep-2006 | f | INDIVIDUAL | 823 Reid Ford\r\nDelacruzside, MA 00813 |
| 4 | 60 months | C | C5 | MORTGAGE | Verified | Apr-2013 | Charged Off | credit_card | Mar-1999 | f | INDIVIDUAL | 679 Luna Roads\r\nGreggshire, VA 11650 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 396025 | 60 months | B | B4 | RENT | Source Verified | Oct-2015 | Fully Paid | debt_consolidation | Nov-2004 | w | INDIVIDUAL | 12951 Williams Crossing\r\nJohnnyville, DC 30723 |
| 396026 | 36 months | C | C1 | MORTGAGE | Source Verified | Feb-2015 | Fully Paid | debt_consolidation | Feb-2006 | f | INDIVIDUAL | 0114 Fowler Field Suite 028\r\nRachelborough, ... |
| 396027 | 36 months | B | B1 | RENT | Verified | Oct-2013 | Fully Paid | debt_consolidation | Mar-1997 | f | INDIVIDUAL | 953 Matthew Points Suite 414\r\nReedfort, NY 7... |
| 396028 | 60 months | C | C2 | MORTGAGE | Verified | Aug-2012 | Fully Paid | debt_consolidation | Nov-1990 | f | INDIVIDUAL | 7843 Blake Freeway Apt. 229\r\nNew Michael, FL... |
| 396029 | 36 months | C | C2 | RENT | Verified | Jun-2010 | Fully Paid | debt_consolidation | Sep-1998 | f | INDIVIDUAL | 787 Michelle Causeway\r\nBriannaton, AR 48052 |
395219 rows × 12 columns
Ahora a revisar todas las variables cadena para ver qué hacer con ellas.
TAREA: Convierta la columna term a un dato entero de 36 o 60 utilizadndo .apply() or .map().
# SU CODIGO AQUI
df.term = df.term.map({' 36 months':36, ' 60 months': 60})
df.term
0 36
1 36
2 36
3 36
4 60
..
396025 60
396026 36
396027 36
396028 60
396029 36
Name: term, Length: 395219, dtype: int64
TAREA: Ya sabemos que grade es parte de sub_grade, así que solo elimine la columna grade.
# SU CODIGO AQUI
df = df.drop(columns='grade')
TAREA: Convierta la columna subgrade a variables dummy. Concatene estas nuevas columnas al dataframe original. Recuerde botar la columna subgrade original y de agregar drop_first = True a su llamada get_dummies.
# SU CODIGO AQUI
df = pd.concat([df.drop('sub_grade', axis=1), pd.get_dummies(df['sub_grade'], drop_first=True)], axis=1)
df
| loan_amnt | term | int_rate | installment | home_ownership | annual_inc | verification_status | issue_d | loan_status | purpose | ... | F1 | F2 | F3 | F4 | F5 | G1 | G2 | G3 | G4 | G5 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 | 11.44 | 329.48 | RENT | 117000.0 | Not Verified | Jan-2015 | Fully Paid | vacation | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 8000.0 | 36 | 11.99 | 265.68 | MORTGAGE | 65000.0 | Not Verified | Jan-2015 | Fully Paid | debt_consolidation | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 15600.0 | 36 | 10.49 | 506.97 | RENT | 43057.0 | Source Verified | Jan-2015 | Fully Paid | credit_card | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 7200.0 | 36 | 6.49 | 220.65 | RENT | 54000.0 | Not Verified | Nov-2014 | Fully Paid | credit_card | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 24375.0 | 60 | 17.27 | 609.33 | MORTGAGE | 55000.0 | Verified | Apr-2013 | Charged Off | credit_card | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 396025 | 10000.0 | 60 | 10.99 | 217.38 | RENT | 40000.0 | Source Verified | Oct-2015 | Fully Paid | debt_consolidation | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396026 | 21000.0 | 36 | 12.29 | 700.42 | MORTGAGE | 110000.0 | Source Verified | Feb-2015 | Fully Paid | debt_consolidation | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396027 | 5000.0 | 36 | 9.99 | 161.32 | RENT | 56500.0 | Verified | Oct-2013 | Fully Paid | debt_consolidation | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396028 | 21000.0 | 60 | 15.31 | 503.02 | MORTGAGE | 64000.0 | Verified | Aug-2012 | Fully Paid | debt_consolidation | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396029 | 2000.0 | 36 | 13.61 | 67.98 | RENT | 42996.0 | Verified | Jun-2010 | Fully Paid | debt_consolidation | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
395219 rows × 57 columns
TAREA: Convierta estas columnas: ['verification_status', 'application_type', 'initial_list_status', 'purpose'] a variables dummy y concaténelas con del dataframe original. Recuerde poner drop_first = True y eliminar las columnas originales.
# SU CODIGO AQUI
df = pd.concat([df.drop(columns=['verification_status'], axis=1), pd.get_dummies(df['verification_status'], drop_first=True)], axis=1)
df = pd.concat([df.drop(columns=['application_type'], axis=1), pd.get_dummies(df['application_type'], drop_first=True)], axis=1)
df = pd.concat([df.drop(columns=['initial_list_status'], axis=1), pd.get_dummies(df['initial_list_status'], drop_first=True)], axis=1)
df = pd.concat([df.drop(columns=['purpose'], axis=1), pd.get_dummies(df['purpose'], drop_first=True)], axis=1)
df
| loan_amnt | term | int_rate | installment | home_ownership | annual_inc | issue_d | loan_status | dti | earliest_cr_line | ... | home_improvement | house | major_purchase | medical | moving | other | renewable_energy | small_business | vacation | wedding | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 | 11.44 | 329.48 | RENT | 117000.0 | Jan-2015 | Fully Paid | 26.24 | Jun-1990 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 8000.0 | 36 | 11.99 | 265.68 | MORTGAGE | 65000.0 | Jan-2015 | Fully Paid | 22.05 | Jul-2004 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 15600.0 | 36 | 10.49 | 506.97 | RENT | 43057.0 | Jan-2015 | Fully Paid | 12.79 | Aug-2007 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 7200.0 | 36 | 6.49 | 220.65 | RENT | 54000.0 | Nov-2014 | Fully Paid | 2.60 | Sep-2006 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 24375.0 | 60 | 17.27 | 609.33 | MORTGAGE | 55000.0 | Apr-2013 | Charged Off | 33.95 | Mar-1999 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 396025 | 10000.0 | 60 | 10.99 | 217.38 | RENT | 40000.0 | Oct-2015 | Fully Paid | 15.63 | Nov-2004 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396026 | 21000.0 | 36 | 12.29 | 700.42 | MORTGAGE | 110000.0 | Feb-2015 | Fully Paid | 21.45 | Feb-2006 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396027 | 5000.0 | 36 | 9.99 | 161.32 | RENT | 56500.0 | Oct-2013 | Fully Paid | 17.56 | Mar-1997 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396028 | 21000.0 | 60 | 15.31 | 503.02 | MORTGAGE | 64000.0 | Aug-2012 | Fully Paid | 15.88 | Nov-1990 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396029 | 2000.0 | 36 | 13.61 | 67.98 | RENT | 42996.0 | Jun-2010 | Fully Paid | 8.32 | Sep-1998 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
395219 rows × 71 columns
TAREA: Revise los conteos de los valores de la columna home_ownership column.
# SU CODIGO AQUI
df['home_ownership'].value_counts()
MORTGAGE 198022 RENT 159395 OWN 37660 OTHER 110 NONE 29 ANY 3 Name: home_ownership, dtype: int64
TAREA: Convertir estos a variables dummy, pero reemplace NONE y ANY con OTHER, de tal forma que queden solo 4 categorías, MORTGAGE, RENT, OWN, OTHER. Luego concaténelas con el dataframe original. Recuerde usar drop_first = True y eliminar las columnas originales.
# SU CODIGO AQUI
df['home_ownership'] = df['home_ownership'].replace('NONE', 'OTHER')
df['home_ownership'] = df['home_ownership'].replace('ANY', 'OTHER')
df = pd.concat([df.drop(columns=['home_ownership'], axis=1), pd.get_dummies(df['home_ownership'], drop_first=True)], axis=1)
df
| loan_amnt | term | int_rate | installment | annual_inc | issue_d | loan_status | dti | earliest_cr_line | open_acc | ... | medical | moving | other | renewable_energy | small_business | vacation | wedding | OTHER | OWN | RENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 | 11.44 | 329.48 | 117000.0 | Jan-2015 | Fully Paid | 26.24 | Jun-1990 | 16.0 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
| 1 | 8000.0 | 36 | 11.99 | 265.68 | 65000.0 | Jan-2015 | Fully Paid | 22.05 | Jul-2004 | 17.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 15600.0 | 36 | 10.49 | 506.97 | 43057.0 | Jan-2015 | Fully Paid | 12.79 | Aug-2007 | 13.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3 | 7200.0 | 36 | 6.49 | 220.65 | 54000.0 | Nov-2014 | Fully Paid | 2.60 | Sep-2006 | 6.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 24375.0 | 60 | 17.27 | 609.33 | 55000.0 | Apr-2013 | Charged Off | 33.95 | Mar-1999 | 13.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 396025 | 10000.0 | 60 | 10.99 | 217.38 | 40000.0 | Oct-2015 | Fully Paid | 15.63 | Nov-2004 | 6.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 396026 | 21000.0 | 36 | 12.29 | 700.42 | 110000.0 | Feb-2015 | Fully Paid | 21.45 | Feb-2006 | 6.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396027 | 5000.0 | 36 | 9.99 | 161.32 | 56500.0 | Oct-2013 | Fully Paid | 17.56 | Mar-1997 | 15.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 396028 | 21000.0 | 60 | 15.31 | 503.02 | 64000.0 | Aug-2012 | Fully Paid | 15.88 | Nov-1990 | 9.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396029 | 2000.0 | 36 | 13.61 | 67.98 | 42996.0 | Jun-2010 | Fully Paid | 8.32 | Sep-1998 | 3.0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
395219 rows × 73 columns
TAREA: Cree (hacer feature engineering) una columna de código postal "zip code" a partir de la columna address.
# SU CODIGO AQUI
# Se crea la nueva columna y se asignan los valores de los ultimos 5 caracteres que contienen el codigo postal
df['zip_code'] = df['address'].str[-5:]
df['zip_code'].head()
0 22690 1 05113 2 05113 3 00813 4 11650 Name: zip_code, dtype: object
TAREA: Convierta la columna de código postal en variables dummy usando pandas. Concatene el resultado y elimine la columna original zip_code así como la columna address.
df = df.drop(columns='address')
df = pd.concat([df.drop(columns=['zip_code'], axis=1), pd.get_dummies(df['zip_code'], drop_first=True)], axis=1)
df
| loan_amnt | term | int_rate | installment | annual_inc | issue_d | loan_status | dti | earliest_cr_line | open_acc | ... | RENT | 05113 | 11650 | 22690 | 29597 | 30723 | 48052 | 70466 | 86630 | 93700 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 | 11.44 | 329.48 | 117000.0 | Jan-2015 | Fully Paid | 26.24 | Jun-1990 | 16.0 | ... | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 8000.0 | 36 | 11.99 | 265.68 | 65000.0 | Jan-2015 | Fully Paid | 22.05 | Jul-2004 | 17.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 15600.0 | 36 | 10.49 | 506.97 | 43057.0 | Jan-2015 | Fully Paid | 12.79 | Aug-2007 | 13.0 | ... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 7200.0 | 36 | 6.49 | 220.65 | 54000.0 | Nov-2014 | Fully Paid | 2.60 | Sep-2006 | 6.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 24375.0 | 60 | 17.27 | 609.33 | 55000.0 | Apr-2013 | Charged Off | 33.95 | Mar-1999 | 13.0 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 396025 | 10000.0 | 60 | 10.99 | 217.38 | 40000.0 | Oct-2015 | Fully Paid | 15.63 | Nov-2004 | 6.0 | ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 396026 | 21000.0 | 36 | 12.29 | 700.42 | 110000.0 | Feb-2015 | Fully Paid | 21.45 | Feb-2006 | 6.0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 396027 | 5000.0 | 36 | 9.99 | 161.32 | 56500.0 | Oct-2013 | Fully Paid | 17.56 | Mar-1997 | 15.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 396028 | 21000.0 | 60 | 15.31 | 503.02 | 64000.0 | Aug-2012 | Fully Paid | 15.88 | Nov-1990 | 9.0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 396029 | 2000.0 | 36 | 13.61 | 67.98 | 42996.0 | Jun-2010 | Fully Paid | 8.32 | Sep-1998 | 3.0 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
395219 rows × 81 columns
TAREA: Esto sería fuga de datos, no se sabría de antemano si un prestamo fue, o no fue, otorgado al usar el modelo así que en teoría no existiría esta fecha, elimine esta columna.
# SU CODIGO AQUI
df = df.drop(columns='issue_d')
TAREA: Esta parece ser una fecha histórica. Extraiga el año de esta columna utilizando una función .apply, luego conviértala a un tipo numérico. Póngale de nombre a esta column 'earliest_cr_year'. Luego elimine la columna earliest_cr_line.
# SU CODIGO AQUI
df['earliest_cr_year'] = df['earliest_cr_line'].apply(lambda x: str(x)[-4:])
df['earliest_cr_year'] = pd.to_numeric(df['earliest_cr_year'], errors='coerce')
df = df.drop(columns='earliest_cr_line')
TAREA: Importe train_test_split de sklearn.
from sklearn.model_selection import train_test_split
TAREA: elimine la columna load_status que se creo antes, ya que es una duplicación de la columna loan_repaid. Se usará la columna loan_repaid ya que ya está en 0s y 1s.
# SU CODIGO AQUI
df = df.drop(columns='loan_status')
df.select_dtypes(['object']).count()
Series([], dtype: int64)
TAREA: Establezca las variables X e y.
# SU CODIGO AQUI
X = df.drop(columns='loan_repaid')
y = df['loan_repaid']
X
| loan_amnt | term | int_rate | installment | annual_inc | dti | open_acc | pub_rec | revol_bal | revol_util | ... | 05113 | 11650 | 22690 | 29597 | 30723 | 48052 | 70466 | 86630 | 93700 | earliest_cr_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10000.0 | 36 | 11.44 | 329.48 | 117000.0 | 26.24 | 16.0 | 0.0 | 36369.0 | 41.8 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1990 |
| 1 | 8000.0 | 36 | 11.99 | 265.68 | 65000.0 | 22.05 | 17.0 | 0.0 | 20131.0 | 53.3 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2004 |
| 2 | 15600.0 | 36 | 10.49 | 506.97 | 43057.0 | 12.79 | 13.0 | 0.0 | 11987.0 | 92.2 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2007 |
| 3 | 7200.0 | 36 | 6.49 | 220.65 | 54000.0 | 2.60 | 6.0 | 0.0 | 5472.0 | 21.5 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2006 |
| 4 | 24375.0 | 60 | 17.27 | 609.33 | 55000.0 | 33.95 | 13.0 | 0.0 | 24584.0 | 69.8 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1999 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 396025 | 10000.0 | 60 | 10.99 | 217.38 | 40000.0 | 15.63 | 6.0 | 0.0 | 1990.0 | 34.3 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 2004 |
| 396026 | 21000.0 | 36 | 12.29 | 700.42 | 110000.0 | 21.45 | 6.0 | 0.0 | 43263.0 | 95.7 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2006 |
| 396027 | 5000.0 | 36 | 9.99 | 161.32 | 56500.0 | 17.56 | 15.0 | 0.0 | 32704.0 | 66.9 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1997 |
| 396028 | 21000.0 | 60 | 15.31 | 503.02 | 64000.0 | 15.88 | 9.0 | 0.0 | 15704.0 | 53.8 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1990 |
| 396029 | 2000.0 | 36 | 13.61 | 67.98 | 42996.0 | 8.32 | 3.0 | 0.0 | 4292.0 | 91.3 | ... | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1998 |
395219 rows × 78 columns
# df = df.sample(frac=0.1,random_state=101)
print(len(df))
395219
TAREA: Realice una división train/test con test_size = 0.2 y un random_state de 101.
# SU CODIGO AQUI
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=101)
X_train
| loan_amnt | term | int_rate | installment | annual_inc | dti | open_acc | pub_rec | revol_bal | revol_util | ... | 05113 | 11650 | 22690 | 29597 | 30723 | 48052 | 70466 | 86630 | 93700 | earliest_cr_year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 372845 | 9700.0 | 36 | 17.27 | 347.14 | 30000.0 | 18.00 | 10.0 | 0.0 | 9753.0 | 51.9 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2006 |
| 8087 | 20000.0 | 36 | 13.65 | 680.16 | 55000.0 | 21.65 | 12.0 | 0.0 | 17144.0 | 41.1 | ... | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 2003 |
| 381049 | 13350.0 | 60 | 17.27 | 333.73 | 55000.0 | 20.53 | 8.0 | 0.0 | 18990.0 | 79.5 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2003 |
| 18334 | 4000.0 | 36 | 9.17 | 127.52 | 65000.0 | 16.38 | 7.0 | 1.0 | 4433.0 | 32.8 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1996 |
| 34030 | 20000.0 | 36 | 8.90 | 635.07 | 230000.0 | 9.32 | 12.0 | 0.0 | 42071.0 | 59.5 | ... | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2000 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 312547 | 15000.0 | 36 | 14.99 | 519.91 | 46000.0 | 26.04 | 9.0 | 0.0 | 9009.0 | 88.8 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1989 |
| 137037 | 9600.0 | 36 | 18.25 | 348.27 | 46500.0 | 21.60 | 9.0 | 1.0 | 10348.0 | 79.6 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1994 |
| 205041 | 8550.0 | 36 | 9.67 | 274.57 | 46000.0 | 8.35 | 17.0 | 0.0 | 7733.0 | 79.7 | ... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2000 |
| 214988 | 12000.0 | 60 | 16.24 | 293.35 | 45000.0 | 26.16 | 19.0 | 0.0 | 7772.0 | 46.8 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1999 |
| 177347 | 35000.0 | 36 | 19.99 | 1300.55 | 120000.0 | 25.71 | 10.0 | 0.0 | 9763.0 | 40.3 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1997 |
316175 rows × 78 columns
TAREA: Use MinMaxScaler para normalizar los datos de X_train y X_test. Recuerde que no se desea tener una fuga de datos del conjunto de prueba por lo que el ajuste se hace solamente con los datos de X_train.
# SU CODIGO AQUI
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
TAREA: Ejecute la celda de abajo para importar las funciones necesarias de Keras.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense,Dropout
TAREA: Construya un modelo secuencial a entrenar con los datos. Acá hay opciones ilimitadas pero esta es una sugerencia: un modelo que vaya 78 --> 39 --> 19--> 1 neurona de salida. OPCIONAL: Explore la posibilidad de agregar Capas Dropout
1)
# SU CODIGO AQUI
model = Sequential()
# Seleccione el número de capas y neuronas (unidades) que desea.
# https://stats.stackexchange.com/questions/181/how-to-choose-the-number-of-hidden-layers-and-nodes-in-a-feedforward-neural-netw
# Recuerde que hay que compilar - compile()
model.add(Dense(78, activation = 'relu'))
model.add(Dense(39, activation = 'relu'))
model.add(Dense(19, activation = 'relu'))
model.add(Dense(1, activation = 'sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
TAREA: Ajuste el modelo a los datos de entrenamiento para un número mínimo de 25 epocas. También agregue los datos de validación para hacer gráficas más tarde. Opcional: agregue un tamaño de tandas de (batch_size) de 256.
# SU CODIGO AQUI
history = model.fit(x=X_train, y=y_train, epochs=50, batch_size=256, validation_data=(X_test, y_test))
history
Epoch 1/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2854 - accuracy: 0.8805 - val_loss: 0.2740 - val_accuracy: 0.8819 Epoch 2/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2626 - accuracy: 0.8880 - val_loss: 0.2639 - val_accuracy: 0.8867 Epoch 3/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2605 - accuracy: 0.8882 - val_loss: 0.2668 - val_accuracy: 0.8859 Epoch 4/50 1236/1236 [==============================] - 1s 1ms/step - loss: 0.2591 - accuracy: 0.8884 - val_loss: 0.2631 - val_accuracy: 0.8874 Epoch 5/50 1236/1236 [==============================] - 1s 1ms/step - loss: 0.2583 - accuracy: 0.8888 - val_loss: 0.2631 - val_accuracy: 0.8873 Epoch 6/50 1236/1236 [==============================] - 1s 1ms/step - loss: 0.2578 - accuracy: 0.8888 - val_loss: 0.2616 - val_accuracy: 0.8877 Epoch 7/50 1236/1236 [==============================] - 1s 1ms/step - loss: 0.2576 - accuracy: 0.8887 - val_loss: 0.2629 - val_accuracy: 0.8875 Epoch 8/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2571 - accuracy: 0.8890 - val_loss: 0.2612 - val_accuracy: 0.8878 Epoch 9/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2566 - accuracy: 0.8891 - val_loss: 0.2617 - val_accuracy: 0.8876 Epoch 10/50 1236/1236 [==============================] - 2s 2ms/step - loss: 0.2562 - accuracy: 0.8892 - val_loss: 0.2613 - val_accuracy: 0.8881 Epoch 11/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2558 - accuracy: 0.8894 - val_loss: 0.2622 - val_accuracy: 0.8878 Epoch 12/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2555 - accuracy: 0.8894 - val_loss: 0.2614 - val_accuracy: 0.8876 Epoch 13/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2552 - accuracy: 0.8896 - val_loss: 0.2615 - val_accuracy: 0.8876 Epoch 14/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2548 - accuracy: 0.8895 - val_loss: 0.2631 - val_accuracy: 0.8873 Epoch 15/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2545 - accuracy: 0.8898 - val_loss: 0.2620 - val_accuracy: 0.8867 Epoch 16/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2542 - accuracy: 0.8897 - val_loss: 0.2628 - val_accuracy: 0.8879 Epoch 17/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2538 - accuracy: 0.8900 - val_loss: 0.2622 - val_accuracy: 0.8878 Epoch 18/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2536 - accuracy: 0.8901 - val_loss: 0.2661 - val_accuracy: 0.8875 Epoch 19/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2532 - accuracy: 0.8902 - val_loss: 0.2616 - val_accuracy: 0.8873 Epoch 20/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2529 - accuracy: 0.8902 - val_loss: 0.2626 - val_accuracy: 0.8876 Epoch 21/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2527 - accuracy: 0.8905 - val_loss: 0.2678 - val_accuracy: 0.8874 Epoch 22/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2524 - accuracy: 0.8906 - val_loss: 0.2631 - val_accuracy: 0.8877 Epoch 23/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2520 - accuracy: 0.8906 - val_loss: 0.2631 - val_accuracy: 0.8872 Epoch 24/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2517 - accuracy: 0.8909 - val_loss: 0.2632 - val_accuracy: 0.8872 Epoch 25/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2514 - accuracy: 0.8911 - val_loss: 0.2642 - val_accuracy: 0.8852 Epoch 26/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2511 - accuracy: 0.8910 - val_loss: 0.2629 - val_accuracy: 0.8875 Epoch 27/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2509 - accuracy: 0.8911 - val_loss: 0.2630 - val_accuracy: 0.8871 Epoch 28/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2506 - accuracy: 0.8909 - val_loss: 0.2642 - val_accuracy: 0.8867 Epoch 29/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2502 - accuracy: 0.8913 - val_loss: 0.2635 - val_accuracy: 0.8872 Epoch 30/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2500 - accuracy: 0.8912 - val_loss: 0.2654 - val_accuracy: 0.8868 Epoch 31/50 1236/1236 [==============================] - 2s 2ms/step - loss: 0.2498 - accuracy: 0.8911 - val_loss: 0.2652 - val_accuracy: 0.8874 Epoch 32/50 1236/1236 [==============================] - 2s 2ms/step - loss: 0.2495 - accuracy: 0.8918 - val_loss: 0.2670 - val_accuracy: 0.8869 Epoch 33/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2493 - accuracy: 0.8915 - val_loss: 0.2644 - val_accuracy: 0.8856 Epoch 34/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2489 - accuracy: 0.8918 - val_loss: 0.2653 - val_accuracy: 0.8864 Epoch 35/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2486 - accuracy: 0.8920 - val_loss: 0.2654 - val_accuracy: 0.8864 Epoch 36/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2484 - accuracy: 0.8921 - val_loss: 0.2666 - val_accuracy: 0.8875 Epoch 37/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2483 - accuracy: 0.8920 - val_loss: 0.2678 - val_accuracy: 0.8866 Epoch 38/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2481 - accuracy: 0.8921 - val_loss: 0.2665 - val_accuracy: 0.8858 Epoch 39/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2477 - accuracy: 0.8921 - val_loss: 0.2670 - val_accuracy: 0.8857 Epoch 40/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2474 - accuracy: 0.8923 - val_loss: 0.2664 - val_accuracy: 0.8865 Epoch 41/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2472 - accuracy: 0.8925 - val_loss: 0.2657 - val_accuracy: 0.8867 Epoch 42/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2470 - accuracy: 0.8924 - val_loss: 0.2680 - val_accuracy: 0.8843 Epoch 43/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2468 - accuracy: 0.8924 - val_loss: 0.2691 - val_accuracy: 0.8848 Epoch 44/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2467 - accuracy: 0.8928 - val_loss: 0.2686 - val_accuracy: 0.8857 Epoch 45/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2464 - accuracy: 0.8927 - val_loss: 0.2698 - val_accuracy: 0.8845 Epoch 46/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2461 - accuracy: 0.8930 - val_loss: 0.2693 - val_accuracy: 0.8863 Epoch 47/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2460 - accuracy: 0.8928 - val_loss: 0.2713 - val_accuracy: 0.8868 Epoch 48/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2459 - accuracy: 0.8929 - val_loss: 0.2695 - val_accuracy: 0.8861 Epoch 49/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2456 - accuracy: 0.8932 - val_loss: 0.2734 - val_accuracy: 0.8839 Epoch 50/50 1236/1236 [==============================] - 2s 1ms/step - loss: 0.2453 - accuracy: 0.8934 - val_loss: 0.2681 - val_accuracy: 0.8861
<keras.callbacks.History at 0x17cd9535640>
TAREA: OPCIONAL: Guarde su modelo.
# SU CODIGO AQUI
TAREA: Grafique la pérdida de validación versus la pérdida de entrenamiento.
# SU CODIGO AQUI
history.history.keys()
dict_keys(['loss', 'accuracy', 'val_loss', 'val_accuracy'])
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='lower right')
plt.show()
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'validation'], loc='lower right')
plt.show()
TAREA: Crear predicciones a partir de conjunto de datos X_test y muestre un reporte de clasificación y matriz de confusión para este conjunto de datos.
# SU CODIGO AQUI
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
testPredict = model.predict(X_test)
testPredict = np.round(abs(testPredict))
testPredict
array([[1.],
[1.],
[1.],
...,
[1.],
[1.],
[0.]], dtype=float32)
tf.math.confusion_matrix(labels=y_test, predictions=testPredict).numpy()
array([[ 7434, 8224],
[ 780, 62606]])
print(classification_report(y_test, testPredict, zero_division=0))
precision recall f1-score support
0 0.91 0.47 0.62 15658
1 0.88 0.99 0.93 63386
accuracy 0.89 79044
macro avg 0.89 0.73 0.78 79044
weighted avg 0.89 0.89 0.87 79044
testPredict
array([[1.],
[1.],
[1.],
...,
[1.],
[1.],
[0.]], dtype=float32)
TAREA: Dado el cliente indicado abajo, ¿le ofrecería un préstamo?
import random
random.seed(101)
random_ind = random.randint(0,len(df))
nuevo_cliente = df.drop('loan_repaid', axis=1).iloc[random_ind]
nuevo_cliente
loan_amnt 25000.00
term 60.00
int_rate 18.24
installment 638.11
annual_inc 61665.00
...
48052 0.00
70466 0.00
86630 0.00
93700 0.00
earliest_cr_year 1996.00
Name: 305323, Length: 78, dtype: float64
# SU CODIGO AQUI
model.predict(nuevo_cliente.values.reshape(1, 78))
array([[1.]], dtype=float32)
TAREA: Ahora revise los datos originales, ¿Pagó este cliente su préstamo?
# SU CODIGO AQUI
df.loc[random_ind, 'loan_repaid']
1
El modelo predijo correctamente al cliente